[1] 0 1 0 0 0 1 1 0 1 0
[1] 0.2460938
[1] 0.3769531
2024-10-07
\[ f\left(k\right) = \binom{n}{k} \theta^k\left(1-\theta\right)^{n-k} \] - Success or failures; 1/0; binary variables.
A Bernoulli variable is a binary variable, observed over a series of \(n\) trials, that takes on a value of 1 with probability \(\theta\) and 0 with probability \(1-\theta\).
The Binomial Distribution is the associated probability density function (PDF) for the number of successes in \(n\) trials.
\(P(D|\theta)\). This is the likelihood of observing the data
\(P(\theta)\). This is the prior of of the parameter
\(P(D)\). This is the probability of the data
\(P(\theta | D)\) is the posterior distribution
Possible values of theta: 0, 0.01, 0.02, 0.03, 0.04, 0.05 ...
Assume we observe 902 heads, 341 tails. What's our best guess?
The maximum value of the distribution is at theta = 0.73
Possible values of theta: 0, 0.01, 0.02, 0.03, 0.04, 0.05 ...
Assume we observe 902 heads, 341 tails. What's our best guess?
The maximum value of the distribution is at theta = 0.73
A noninformative prior
Every value of \(\theta\) is equally probable; a uniform density
The uniform along the interval [0,1] means any value of \(\theta\) on this interval is as likely as any other value
grid_values <- seq(0, 1, by = 0.01)
# Define the prior, likelihood
prior <- rep(1, length(grid_values)) # All values equally probable
likelihood <- dbinom(902, 1243, grid_values) # binomial
# Calculate the posterior
posterior <- likelihood * prior
posterior <- posterior / sum(posterior) # p(D)
data <- data.frame(grid_values = grid_values, posterior = posterior)
# Calculate the 95% credible interval
cumulative_posterior <- cumsum(posterior)
lower_bound <- grid_values[which(cumulative_posterior >= 0.025)[1]] # find bounds
upper_bound <- grid_values[which(cumulative_posterior >= 0.975)[1]]Possible values of theta, 0 to 1: 0, 0.01, 0.02, 0.03, 0.04, 0.05 ...
Assume we observe 902 heads, 341 tails. What's our best guess?
The maximum value of the distribution is at theta = 0.73
The 95% credible interval is from 0.7 to 0.75
# Sample 100 values from the posterior
sample(grid_values, 100, prob = posterior, replace = TRUE) %>% mean()[1] 0.7229
\[log(p(D|\theta))=\sum_{n=1}^N[y_n log\theta+(1-y_n)log(1-\theta)] \] \[log(p(D|\theta)) \equiv LL(\theta)\]
Multiplication can give exceedingly large (or small) numbers
The “unknown” parameter in these trials is \(\theta\), \(\hat{\theta}\), the probability of success, \(p(y=1)\)
We could use either the Bernoulli or Binomial densities
\(\hat {\theta}\) as some function of predictor variables
\[\hat \theta_i=a+b_x+\epsilon\]
\[var(y|x)=pr(y=1|x)(1-pr(y=1|x))=xb(1-xb)\]
Functional form. The linear model assumes a constant effect of \(x\rightarrow y\)
This may be unlikely, particularly when dealing with a probability. Maybe we expect the effect of \(x\) on \(y\) to be non-linear, where the effect of \(x\) on \(y\) is not constant
The linear model assumes a constant effect of \(x\rightarrow y\)
\[
\frac{\partial y}{\partial x} = b
\]
\[ x_{obs} \rightarrow y_{latent} \rightarrow y_{obs} \]
If we knew \(y_{latent}\), all would be good and we could estimate OLS
Instead, we observe a realization of \(y_{latent}\) in \(y_{obs}\)
We’d like to estimate \(y_{latent}=x_iB+e_i\)
We can’t, so the errors are unknown
What residuals can we minimize?
Instead, let’s make an assumption about the error process, either the errors
\(e_i \sim normal(0, 1)\) is the probit regression model
\(e_i \sim logistic(0, \pi^2/3)\) is the logit regression model
And, Sigmoid \(=\) logistic
\[probit={{1}\over{\sqrt{2 \pi}}}exp({-{t^2}\over {2}})\]
\[probit=\int_{-\infty}^{e}{{1}\over{\sqrt{2 \pi}}}exp({-{t^2}\over {2}})dt\]
\[logit(e)={{exp(e)}\over{1+exp(e)^2}}\]
\[logit(e)={{exp(e)}\over{1+exp(e)}}\]
The choice – logit or probit – is somewhat arbitrary
Returning to the example
\(\bullet\) The mean of the error for the latent variable \(e\) is 0
\(\bullet\) \(\tau\) is 0
\(\bullet\) The variance of the error term is constant
If
\[p(y_{obs}=1|x)=p(y_{latent}>0|x)\]
then,
\[p(x\beta+\epsilon|x>0|x)=p(\epsilon<-x\beta|x)\]
Because the distribution for \(e\) must be symmetric
\[p(x\beta+\epsilon>0|x>0|x)=p(\epsilon>-x\beta|x)=p(\epsilon<x\beta|x)\]
\(y_{latent}\) is unobserved, we need to make an assumption about the scale of the error term.
This is the scale indeterminacy problem. What we choose, will affect the parameter estimates.
Logit coefficients will differ from probit coefficients, with the same data.
……..by a factor of 1.81
\[1.81 \times \beta_{probit}=\beta_{logit}\]
load("~/Dropbox/github_repos/teaching/POL683_Fall24/advancedRegression/vignettes/dataActive.rda")
dat = dataActive %>%
mutate(
support_protest = ifelse(violent >3, 1, 0)
) %>%
select(support_protest, VIOLENT)
# Estimate logit
logitModel = glm(support_protest ~ VIOLENT, data = dat, family = binomial(link = "logit"))
# Estimate probit
probitModel = glm(support_protest ~ VIOLENT, data = dat, family = binomial(link = "probit"))
Call:
glm(formula = support_protest ~ VIOLENT, family = binomial(link = "logit"),
data = dat)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.34186 0.04804 -7.115 1.12e-12 ***
VIOLENT -0.55257 0.07058 -7.829 4.92e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4671.4 on 3599 degrees of freedom
Residual deviance: 4609.4 on 3598 degrees of freedom
AIC: 4613.4
Number of Fisher Scoring iterations: 4
Call:
glm(formula = support_protest ~ VIOLENT, family = binomial(link = "probit"),
data = dat)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.21378 0.02992 -7.145 9.01e-13 ***
VIOLENT -0.33902 0.04316 -7.855 3.99e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4671.4 on 3599 degrees of freedom
Residual deviance: 4609.4 on 3598 degrees of freedom
AIC: 4613.4
Number of Fisher Scoring iterations: 4
# Create a dataset
pred.data = expand.grid(VIOLENT = c(0, 1)) %>% as.data.frame()
predictions_logit = predict(logitModel, pred.data, type = "response") %>%
data.frame() %>%
mutate(
violent = c(0, 1)
)
predictions_probit = predict(probitModel, pred.data, type = "response") %>%
data.frame() %>%
mutate(
violent = c(0, 1)
)
# The difference
cat("The treatment effect for the logit model is:\n",
predictions_logit[1,1] - predictions_logit[2,1], "\n",
"The treatment effect for the probit model is:\n",
predictions_probit[1,1] - predictions_probit[2,1]
)The treatment effect for the logit model is:
0.1251605
The treatment effect for the probit model is:
0.1251605
\[ log(p(D|\theta))=\sum_{n=1}^Nlogp(y_n|\theta)=\sum_{n=1}^N[y_n log\theta+(1-y_n)log(1-\theta)] \]
If \(\theta\) is bound between 0 and 1. But our regression model requires a continuous depenendent variable,
First, convert the probability to an odds
\[{{\theta_i}\over{1-\theta_i}}\]
Then the lower bounds
The transformation is an “odds” that may approach \(\infty\). But, we still have the zero restriction
Take the natural logarithm of the odds, resulting in the log odds, and the logit transformation
\[\eta_i=log[{{\theta_i)}\over{1-\theta_i}}]\]
\[\eta \in [-\infty, \infty]\]
\[log{{x\beta}\over{1-(x\beta)}}\]
The inverse of this function,is the probability scale.
\[p(y=1|x)={{exp(x\beta)}\over{1+exp(x\beta)}}\]
\[p(y=1|x\beta)=\theta_i={1\over{1+exp(-(x\beta))}}\]
\[p(D|\theta)=\prod_{n=1}^N[\frac{1}{1+exp(-(x\beta))}]^{y_n}[1-\frac{1}{1+exp(-(x\beta)}]^{1-{y_n}}\]
load("~/Dropbox/github_repos/teaching/POL683_Fall24/advancedRegression/vignettes/dataActive.rda")
dat = dataActive %>% mutate( support_protest = ifelse(violent >3, 1, 0)) %>%
select(support_protest, VIOLENT)
# Estimate logit
logitModel = glm(support_protest ~ VIOLENT, data = dat,
family = binomial(link = "logit"))
logitModel
Call: glm(formula = support_protest ~ VIOLENT, family = binomial(link = "logit"),
data = dat)
Coefficients:
(Intercept) VIOLENT
-0.3419 -0.5526
Degrees of Freedom: 3599 Total (i.e. Null); 3598 Residual
Null Deviance: 4671
Residual Deviance: 4609 AIC: 4613